Skip to main content

Data Dumping System

Overview

A Python-based solution for exporting large datasets from MySQL databases with smart batching and Docker containerization.

Features

Current Implementation

  • Supported Dump Types

    • zm: ZM Data Dump (requires date range)
    • mcct: MCCT Data Dump (requires date range)
    • bcu: Daily BCU Data Dumps (single snapshot)
    • vaud: Vaccinator Audit For Health Ministry (single snapshot)
  • Core Functionality

    • 🗓️ Weekly batch processing for date-range dumps
    • 🚰 10k-row streaming using server-side cursors
    • 📦 GZIP compression for output files
    • 📊 Automatic database report download entries

Execution Commands

# Date-range dumps (zm/mcct)
docker compose run --rm dumper --dump {TYPE} {START_DATE} {END_DATE}

# Single-snapshot dumps (bcu/vaud)
docker compose run --rm dumper --dump {TYPE}

Scheduling

** Cron Jobs ** Currently, we're using cron jobs to schedule the BCU dump at 5:01pm every day. Vaccinator Audit dump is also scheduled to run at 10:30am, 1:30pm, and 5:00pm. Both of these dumps handle the date and time range using native SQL DATETIME fucntions. This can be improved.

Docker Integration

# docker-compose.yml excerpt
services:
dumper:
env_file: .env
volumes:
- ./data:/app/data
- ./server_dumps:/mnt/server_dumps
environment:
- QUERY_FILE_ZM=./queries/zm.sql
- QUERY_FILE_MCCT=./queries/mcct.sql

Limitations

Technical Constraints

  1. Parallel Execution
    ❗ Cannot run simultaneous dumps due to:

    • Hardcoded container name
    • Ephemeral container strategy (container is destroyed after each dump task)
  2. Configuration Rigidity
    Adding new dump types requires:

    + code/dumper.py          (Code modifications)
    + docker-compose.yml (ENV variables)
    + queries/new_dump.sql (SQL file)
  3. Date Handling

    • Fixed 7-day batches
    • No time granularity support

Roadmap

Phase 1: Dynamic Configuration

# Proposed config/dumps.yaml
dumps:
zm:
query_file: queries/zm.sql
date_range: required
batch_days: 7
bcu:
query_file: queries/bcu.sql
date_range: prohibited

Phase 2: Parallel Execution Architecture

graph TD
A[API Request] --> B(RabbitMQ)
B --> C[Worker 1] --> E[(MySQL)]
B --> D[Worker 2] --> E[(MySQL)]

Phase 3: Enhanced Filtering

# Proposed time-range support
docker compose run --rm dumper \
--dump zm \
--start "2024-01-01 08:00" \
--end "2024-01-07 18:00" \
--batch-hours 4 \
--batch-rows 10000

Technical Specifications

ComponentTechnologyPurpose
Database DriverPyMySQLMySQL connection management
Batch ProcessingSSCursorServer-side result streaming
CompressiongzipOutput file compression
OrchestrationDocker ComposeContainer lifecycle management
Maintenance Notes
  1. Adding New Dump Types
    Current process requires modifying multiple files. Track progress on dynamic configuration proposal.

  2. Memory Management
    The 10k batch size was tested for:

    • 512MB RAM containers
    • Datasets up to 200M rows